home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Power Tools 1993 November - Disc 2
/
Power Tools Plus (Disc 2 of 2)(November 1993)(HP).iso
/
hotlines
/
gsyhl
/
hpuxcomp.txt
< prev
next >
Wrap
Text File
|
1993-03-29
|
14KB
|
259 lines
The Power of HP-UX Compiler Optimizations
1.0 Abstract
Any code compiled without optimization may be missing a great opportunity for
performance improvement. A study of the effect of compiler optimization on the
benchmarks in the SPEC CINT92 suite shows that, compared to unoptimized code,
there is an easy 15 to 25% gain available and performance gains of 25 to 60%
may be available to those who utilize more of the optimizer. Note that the
benchmarks studied are not highly-tunable or easily-optimized floating-point
benchmarks, these are C language, integer benchmarks which are much like many
of the HP-UX applications.
2.0 Executive Summary
While the exact effect varied from benchmark code to benchmark code, just
turning on Level One (+O1), which should be safe on virtually all code,
generally gained about 20% faster execution. Using Level Two (-O or +O2),
the standard level of optimization, generally gained up to 50% performance
increase. And if every feature available was utilized the gains were usually
in the 40 - 60% range, with one benchmark that did better than a two-fold
increase in performance.
It is well worth investigating the use of optimization for any code compiled
on a HP-UX 9.0 or later system, even if optimization was not considered impor-
tant for earlier releases. The 9.0 and later compilers trade-off simpler
code generation for faster compile and link times if no optimization is
specified. Thus with 9.0, if no optimization is used, it is very likely that
there will be a 10% performance drop even before any kernel or system impacts
are considered.
3.0 Observations
The use of just Level One optimization is virtually a free 20% performance
gain over not using any optimization. Level One optimization just applies the
simple, in- line, statement-level, safe optimizations which can be used by
just about any piece of code. Still, these easy optimizations provided between
15 and 25% improvement in most cases in the CINT92 suite.
Level Two optimization, the standard level of optimization (-O), offers another
performance gain ranging from 5 - 10% percent for "branchy" system-type code to
20 - 30% for more "loopy" analytic-style code. There are benefits for almost
all programs in Level Two optimizations, including a nice reduction in the
overhead of procedure calling, but the big gains will be in those pieces of
code which spend a fair amount of time in a few key loops. For the one case,
out of the six in CINT92, that best exemplifies this "loopy" behavior, Level
Two brought almost a two-fold performance improvement over Level One.
At this point, there was only a small gain by attempting Level Three
optimizations on the examples in the CINT92 suite. There seems to be less
advantage in the ad- vanced optimization methods for this kind of system
application C code. However, Level Three did have a significant impact in the
results of the CFP92, the floating- point suite, which has benchmarks which
are much more analytic in nature.
However, there is another optimization that is well worth considering. The use
of archive libraries, rather than the default of shared libraries can make a
noticeable difference. Even code which rarely made any library calls had a
detectable, 1 - 2%, difference when linked shared. But code which makes a high
number of calls to library routines may see as much as 20% improvement with the
use of archived executables rather than dynamically linked shared libraries.
4.0 Results
4.1 Result Summary
Across this variety of benchmarks, the HP-UX 9.0 compilers average better than
20% improvement on Level One (+O1) optimization alone. Turning on Level Two
generates an average improvement of almost 60%. The use of archived libraries
rather than shared libraries offers an additional 8% performance. This gives a
68% total improvements over the defaults, all of which is available without
touching any source code.
Table 1: Relative Speedups with Various Levels of Optimization
SPECint92 none +O1 +O2 +O3 +O3archived
008 1.00 1.26 1.59 1.59 1.66
022 1.00 1.25 1.52 1.54 1.57
023 1.00 1.38 2.62 2.62 2.77
026 1.00 1.18 1.40 1.42 1.46
072 1.00 1.05 1.11 1.11 1.30
085 1.00 1.17 1.24 1.24 1.34
Average 1.00 1.21 1.58 1.59 1.68
4.2 Individual Benchmarks
The individual results show a fair amount of variation, reflecting the variety
in the six members of the CINT92 suite. SPEC recommends that people do not rely
upon the summary metrics, but rather determine which benchmarks match their
interests and to look carefully at those. With that in mind, the following
short descriptions might be helpful in reviewing these results.
o Espresso and Eqntott are both from scientific applications and should
show the most improvement from compiler optimizations.
o Li, as a Lisp interpreter recursing and backtracking through the
N-Queens problem, would showcase the procedure calling overhead.
o Compress and SC are much like many UN*X commands.
o GCC represents large system applications.
4.2.1 085.gcc
The code in the 085.gcc benchmark is version 1.35 of the Gnu C Compiler. The
benchmark measures the time it takes a system to use this compiler to generate
several executables for the old Sun-3 workstations.
The Gnu Compiler is over 50,000 lines of C code (not counting comments,
blank space, etc.), much of which is fairly representative of large system
applications. Because of its size and its behavior, which is typical of large
software projects, this benchmark has been used by many as a predictor for
performance of both kernel and large application behavior.
There was a 17% improvement going from none to just Level One optimization.
There was almost 25% improvement going from none to Level Two optimization.
Additionally, the use of archive libraries rather than shared libraries
obtained another 10% improvement. All this in typical system level code, which
is supposed to render optimizers ineffective.
4.2.2 072.sc
This is the public domain UN*X spreadsheet program sc(1) run against three dif-
ferent inputs: a mortgage cost calculation, a small budget calculation, and a
SPECmark89 result calculation. The program sc(1) makes great use of the
curses(3) package to do the screen handling and give a Lotus 1-2-3 look. The
benchmark en- forces a common vt220 terminal definition for the curses(3)
handling no matter what the underlying system, so that all systems are doing
the same work.
This benchmark was the one in the suite which showed the least improvement, but
then 072.sc is the benchmark which spends a very large fraction of its time in
library routines: 80%. All of that time spent in the libraries is unaffected by
any compiler optimizations. Even with spending only 1/5th of its time in the
code that the compilers can have an effect upon, Level One optimization got a
gain of 5%, and Level Two brought the advantage up over 10%. This implies that
the optimizer made differences of 25 and 50% in the code that it operated upon.
However, with all the calls to the library routines, this is the benchmark
which gains the most from using archived instead of shared libraries, almost
20%. Thus, it is very important that those programs which make a lot of calls
to library routines are linked from the archived libraries rather than from
the shared libraries. Together, even this benchmark, can run 30% faster than
with the defaults.
4.2.3 026.compress
The 026.compress benchmark is from a public-domain version of the UN*X
compress(1) utility using the Lempel-Ziv algorithm. This benchmark reads its
standard input, which is a 1 MB file of random text, and writes a compressed
form to the standard output; then the compressed file is fed back through
standard IO to be uncompressed. The code loops through reading in a block of
data, computing its compressed form, and then writing that out. This makes for
a typical UN*X filter, though perhaps with a bit higher ratio of compute to IO.
Perhaps owing to the fair bit of computation per IO buffer, this benchmark
gains almost 20% with just Level One optimization. Then, this performance
gain is doubled when Level Two optimization is enabled. Here again, the amount
of computa- tion per buffer, and in particular the looping nature of that
computation, allows the optimizer to work well. Not surprisingly, for a
benchmark which makes comparably few library calls, the difference between
shared and archived libraries was not significant. But again, the total
increase was over 45% better than the default settings.
4.2.4 023.eqntott
One of the most analytical of the CINT92 suite, 023.eqntott is based on a tool
from the area of logic design which translates boolean equations into truth
tables. This integer benchmark is probably the most like the classic floating
point benchmarks: lots of activity inside loops, not a lot of unexpected
branches. Therefore, it is the most likely benchmark to highlight the effect
of the optimizer's capabilities.
As expected, this benchmark shows considerable improvement even from just Level
One optimization: 38%. Then, like most scientific applications, Level Two opti-
mization has a great effect, in this case it approaches a 3-fold increase in
performance. And again, even after a considerable speed-up, the use of archived
libraries adds a noticeable gain.
4.2.5 022.li
This is the benchmark in the CINT92 suite which most exercises the procedure
calling convention, 022.li is a Lisp interpreter running a code which attempts
to solve the 9-Queens problem with a recursive backtracking algorithm.
Again, on this benchmark, Level One gains a 25% performance increase. Level
Two optimization doubles this gain for a total of over a 50% increase. Once
again, just using the default settings leaves out a considerable amount of
performance.
4.2.6 008.espresso
This benchmark is taken from another logic design application, in this case the
application generates and optimizes Progammable Logic Arrays. This is another
of the more computational benchmarks.
In this case, even Level One gets over 25% improvement. And, one more time, the
use of Level Two does better than double the advantage, to almost 60% better
than the basic compilations. On top of that, building archived rather than
with shared libraries gains several more percentage points, to where this
benchmark measures 66% better fully optimized than it measured under default
conditions.
4.3 8.02 Results
There were a lot of enhancements made to the HP-UX 9.0 compilers, but even
so with HP-UX 8.0 it was still worth a fair bit to turn on the optimizer. As
detailed in the table, Level One gained over 15% percent, and Level Two gets
that much again for a total possible improvement of over 30%.
Table 2: HP-UX 8.02: Relative Speedups of Various Levels of Optimization
SPECint92 none +O1 +O2 +O3 +O3archived
008 1.00 1.23 1.41 1.41 1.46
022 1.00 1.16 1.31 1.31 1.30
023 1.00 1.34 1.74 1.73 1.81
026 1.00 1.08 1.24 1.24 1.25
072 1.00 1.03 1.06 1.06 1.15
085 1.00 1.14 1.18 1.18 1.23
Average 1.00 1.16 1.32 1.32 1.37
Most importantly, however, are the differences in the compilers between 8.02
and 9.0. Starting in 9.0, the compilers are using long branches exclusively
unless the optimizer is activated. This makes the work of the linker much
easier, resulting in much faster link times. The effect of this is that code
compiled without optimization will run almost 10% slower under 9.0 than 8.02
without even considering what the kernel and the rest of the system does to
the performance.
Table 3: 9.0 versus 8.02: Relative Speeds at Various Levels of Optimizat
ion
SPECint92 none +O1 +O2 +O3 +O3archived
008 0.75 0.76 0.84 0.84 0.85
022 0.89 0.96 1.04 1.05 1.07
023 0.82 0.84 1.24 1.24 1.26
026 0.91 1.00 1.03 1.05 1.07
072 1.17 1.19 1.22 1.22 1.32
085 0.91 0.94 0.96 0.96 1.00
Average 0.91 0.95 1.05 1.06 1.09
5.0 Conclusion
Compiler optimization technology unlocks the performance potential of the
PA-RISC architecture. The RISC philosophy has fundamentally changed the role of
the compiler. As RISC moves to strike a balance between hardware and software
that exploits the best of each technology, the resulting simple, high-perfor-
mance instruction set enables the compiler to apply optimizations that drama-
tically improve performance. The effectiveness of RISC depends on the compi-
ler's ability to create the optimal instruction sequence by appropriately re-
arranging the program steps. Without these optimizations, many applications
will execute at a performance level far below their potential.